First, you will use the same dataset you chose for previous assignment (M04 Lesson 02 for the partition (k-means, PAM) and hierarchical clustering) from the he the UC Irvine Machine Learning Repository at https://archive.ics.uci.edu/ml/
How did you choose a model for EM? Evaluate the model performance.
Cluster some of your data using EM based clustering that you also used for k-means, PAM, and hierarchical clustering. How do the clustering approaches compare on the same data?
In assignment M4L2, I choose the Breast Cancer Wisconsin (Prognostic) data set
Loading the data:
#data_url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wpbc.data'
#cancer_data <- read.table(url(data_url), sep = ',')
cancer_data<- read.table('wpbc.data.txt', sep=",")
names(cancer_data) <- c('ID number', 'Outcome','Time','radius_mean','texure_mean','perimeter_mean','area_mean','smoothness_mean','compactness_mean','concavity_mean','concave_points_mean','symmetry_mean','fractal_dimension_mean', 'radius_SE','texure_SE','perimeter_SE','area_SE','smoothness_SE','compactness_SE','concavity_SE','concave_points_SE','symmetry_SE','fractal_dimension_SE','radius_worst','texure_worst','perimeter_worst','area_worst','smoothness_worst','compactness_worst','concavity_worst','concave_points_worst','symmetry_worst','fractal_dimension_worst','tumor_size','lymph_node_status')
cancer <- data.frame(cancer_data[,4:13])
head(cancer)
## radius_mean texure_mean perimeter_mean area_mean smoothness_mean
## 1 18.02 27.60 117.50 1013.0 0.09489
## 2 17.99 10.38 122.80 1001.0 0.11840
## 3 21.37 17.44 137.50 1373.0 0.08836
## 4 11.42 20.38 77.58 386.1 0.14250
## 5 20.29 14.34 135.10 1297.0 0.10030
## 6 12.75 15.29 84.60 502.7 0.11890
## compactness_mean concavity_mean concave_points_mean symmetry_mean
## 1 0.1036 0.1086 0.07055 0.1865
## 2 0.2776 0.3001 0.14710 0.2419
## 3 0.1189 0.1255 0.08180 0.2333
## 4 0.2839 0.2414 0.10520 0.2597
## 5 0.1328 0.1980 0.10430 0.1809
## 6 0.1569 0.1664 0.07666 0.1995
## fractal_dimension_mean
## 1 0.06333
## 2 0.07871
## 3 0.06010
## 4 0.09744
## 5 0.05883
## 6 0.07164
str(cancer)
## 'data.frame': 198 obs. of 10 variables:
## $ radius_mean : num 18 18 21.4 11.4 20.3 ...
## $ texure_mean : num 27.6 10.4 17.4 20.4 14.3 ...
## $ perimeter_mean : num 117.5 122.8 137.5 77.6 135.1 ...
## $ area_mean : num 1013 1001 1373 386 1297 ...
## $ smoothness_mean : num 0.0949 0.1184 0.0884 0.1425 0.1003 ...
## $ compactness_mean : num 0.104 0.278 0.119 0.284 0.133 ...
## $ concavity_mean : num 0.109 0.3 0.126 0.241 0.198 ...
## $ concave_points_mean : num 0.0706 0.1471 0.0818 0.1052 0.1043 ...
## $ symmetry_mean : num 0.186 0.242 0.233 0.26 0.181 ...
## $ fractal_dimension_mean: num 0.0633 0.0787 0.0601 0.0974 0.0588 ...
In the breast cancer dataset, there are 35 columns. Two of them are factors and others are number. Here I use column 4-13 as the clustering data.
library("mclust")
## Package 'mclust' version 5.2
## Type 'citation("mclust")' for citing this R package in publications.
em_clust <- Mclust(cancer)
em_clust
## 'Mclust' model object:
## best model: ellipsoidal, equal shape and orientation (VEE) with 2 components
According to the tutorial of mages’ blog, I use fitdistrplus package to fit distributions
library("fitdistrplus")
## Loading required package: MASS
for (i in 1:10){
fit <- fitdist(cancer[,i], distr = "norm", method = "mle", discrete = F)
cat("This is column ",i)
plot(fit)
}
## This is column 1
## This is column 2
## This is column 3
## This is column 4
## This is column 5
## This is column 6
## This is column 7
## This is column 8
## This is column 9
## This is column 10
The function fitdist will created four graphs: the density plot, Q-Q plot, CDF, P-P plot. According to the results of Q-Q plots, the data from these 10 columns are from normal distribution. So I will choose mixture model which is a mixtrure of Gasussians.
#evaluate the model performance
summary(em_clust)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEE (ellipsoidal, equal shape and orientation) model with 2 components:
##
## log.likelihood n df BIC ICL
## 1290.914 198 77 2174.631 2158.937
##
## Clustering table:
## 1 2
## 50 148
# BIC
plot(em_clust, what = "BIC")
#classification
plot(em_clust, what = "classification")
#uncertainty
plot(em_clust, what = "uncertainty")
#density
plot(em_clust, what = "density")
#ICL
ICL = mclustICL(cancer)
summary(ICL)
## Best ICL values:
## VEE,2 VEE,6 VEE,3
## ICL 2158.937 2122.20671 2120.16562
## ICL diff 0.000 -36.73064 -38.77172
plot(ICL)